update frequency
ARelated Work
Transfer in reinforcement learning aims at solving a new target task with no additional learning or sample-efficiently by exploiting agents and information obtained from source tasks. We review a line of research with relevant approaches. This group of approaches reuses policies learned on source tasks for target tasks. Fernández and Veloso [17] suggest an exploration strategy for the learning of a new policy given a new task and learned source policies, where the gain of using each policy is estimated together on-line and one of the policies in the set is selected probabilistically at each step, based on the gain, but they focus on aiding the training of the target policy with samples from the target task rather than improving the zero-shot transfer performance. On the other hand, Dayan [14] introduce successor representations (SRs), state space occupancy representations disentangled from rewards, which allow linear decomposition of value functions.
STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning
Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on certain stochastic momentum estimator, the algorithm requires $\tilde{\mathcal{O}}(\epsilon^{-3/2})$ samples and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication rounds to compute an $\epsilon$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {\it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained.
LLM-based Multi-Agent System for Simulating Strategic and Goal-Oriented Data Marketplaces
Sashihara, Jun, Fujita, Yukihisa, Nakamura, Kota, Kuwahara, Masahiro, Hayashi, Teruaki
Abstract--Data marketplaces, which mediate the purchase and exchange of data from third parties, have attracted growing attention for reducing the cost and effort of data collection while enabling the trading of diverse datasets. However, a systematic understanding of the interactions between market participants, data, and regulations remains limited. T o address this gap, we propose a Large Language Model-based Multi-Agent System (LLM-MAS) for data marketplaces. In our framework, buyer and seller agents powered by LLMs operate with explicit objectives and autonomously perform strategic actions, such as--planning, searching, purchasing, pricing, and updating data. These agents can reason about market dynamics, forecast future demand, and adapt their strategies accordingly. Unlike conventional model-based simulations, which are typically constrained to predefined rules, LLM-MAS supports broader and more adaptive behavior selection through natural language reasoning. We evaluated the framework via simulation experiments using three distribution-based metrics: (1) the number of purchases per dataset, (2) the number of purchases per buyer, and (3) the number of repeated purchases of the same dataset. The results demonstrate that LLM-MAS more faithfully reproduces trading patterns observed in real data marketplaces compared to traditional approaches, and further captures the emergence and evolution of market trends. Data have emerged as a tradable economic resource, and data marketplaces that mediate the purchase and exchange of datasets from third parties have rapidly expanded [1]. These marketplaces streamline data collection that previously required substantial cost and effort, while also providing organizations and researchers with access to diverse, high-quality datasets. As a result, they are increasingly recognized as critical infrastructures that accelerate innovation based on data that were closed within individual organizations [2]. Despite this progress, our understanding of how interactions among market participants, data, and regulations shape market dynamics remains limited. Smooth and efficient data transactions require well-designed and robust data marketplaces [3].
A Supplementary Material
On HuggingFace, we find information about the annotation creators ( e.g., crowdsource, experts, ml-generated) or specific task categories ( e.g., image-classification, image-to-text, text-to-image). Kaggle automatically computes a usability score, which is associated with the tag "well-documented", Kaggle's usability score is based on: Completeness: subtitle, tag, description, cover image . Credibility: provenance, public noteboook, update frequency . Compatibility: license, file format, file description, column description . The usability score is based on only 4 out of 7 aspects from Datasheets [40].
Mechanistic Insights into Grokking from the Embedding Layer
AlquBoj, H. V., AlQuabeh, Hilal, Bojkovic, Velibor, Nwadike, Munachiso, Inui, Kentaro
Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, \(\frac{η_E}{η_W} \propto \frac{σ_{\max}(E)}{σ_{\max}(W)} \cdot \frac{f_W}{f_E}\), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.